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Abstract 

Let (V, A) be a weighted graph with a finite vertex set V, with a symmetric 
matrix of nonncgative weights A and with Laplacian A. Let 5* : V x V i-> R 
be a symmetric kernel defined on the vertex set V. Consider n i.i.d. observations 
(Xj,Xj,Yj),j = 1, ...,n, where Xj,Xj are independent random vertices sampled 
from the uniform distribution in V and Y,- € 1 is a real valued response variable 
such that E(Yj\Xj,Xj) = S*(Xj,X'-),j = l,...,n. The goal is to estimate the kernel 
S* based on the data (Xi,X[, Yi), . . . , (X„, and under the assumption that 

S 1 * is low rank and, at the same time, smooth on the graph (the smoothness being 
characterized by discrete Sobolev norms defined in terms of the graph Laplacian). 
We obtain several results for such problems including minimax lower bounds on 
the L2-error and upper bounds for penalized least squares estimators both with 
nonconvex and with convex penalties. 



1 Introduction 

We study a problem of estimation of a symmetric kernel S* : V x V \— > R defined on a 
large weighted graph with a vertex set V and m := card(F), based on a finite number 
of noisy linear measurements of S*. For simplicity, assume that these are the measure- 
ments of randomly picked entries of m x m matrix (S*(u,v)) UjV £v , which is a standard 
sampling model in matrix completion. More precisely, let (Xj,Xj, Yj), j = 1, . . . , n be n 
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independent copies of a random triple (X, X' ,Y), where X, X' are independent random 
vertices sampled from the uniform distribution II in V and F 6 1 is a "measurement" 
of the kernel S* at a random location (X, X') in the sense that E(Y|X, X') = S*(X, X'). 
In what follows, we assume that, for some constant a > 0, \Y\ < a a.s., which implies 
that 1 5* (it, v)\ < a,u,v G V. The target kernel 5* is to be estimated based on its i.i.d. 
measurements (Xj, X'p Yj), j = 1, . . . , n. We would like to study this problem in the case 
when the target kernel S 1 * is, on the one hand, "low rank" (that is, rank(<S*) is relatively 
small comparing with m) and, on the other hand, it is "smooth" in the sense that its 
"Sobolev type norm" is not too large. Discrete versions of Sobolev norms can be defined 
for functions and kernels on weighted graphs in terms of their graph Laplacians. As a 
typical example, one can consider a problem of learning a binary relationship (say, "sim- 
ilarity") between vertices of the graph (see Koltchinskii and Rangel (2012)). In this case, 
(X, X', Y) G V x V x { — 1, 1}, Y = +1 meaning that the vertices X, X' are "similar" and 
Y = — 1 meaning that they are not. The goal is to predict Y for a given couple of vertices 
(X, X') based on the training data (Xj , X'j ,Yj) , j = 1, . . . ,n. Clearly, the optimal classi- 
fier is sign(5*(X, X')), where S*(X,X') = K(Y\X, X') is the regression function. In the 
learning theory literature, there has been a number of attempts to develop classification 
methods based on similarities between the objects (see, e.g., Balcan et al (2008), Maurer 
(2008), Chen et al (2009)). In problems of this kind, it is of importance to learn kernels 
suitable for representing and predicting such similarity relationships. This is also impor- 
tant in various classification problems in large complex networks (see, e.g., Leskovec et 
al (2010)). Our main motivation, however, is mostly theoretical: we would like to explore 
to which extent taking into account smoothness of the target kernel could improve the 
existing methods of low rank recovery. 

We introduce some notations used throughout the paper. Let Sy be the linear 
space of symmetric kernels S : V x V i— >• M, S(u,v) = S(v,u),u,v G V (or, equivalently, 
symmetric mxm matrices with real entries). Given S G Sy, we use the notation rank(S') 
for the rank of S and tr(5) for its trace. For two functions /, g : V i-)- R, (/ (8) g)(u, v) := 
f(u)g(v). Suppose that S = Y^j=i Hi^j ® ^i) ^ s the spectral representation of S with 
r = rank(S'), m, . . . , fj, r being non-zero eigenvalues of S repeated with their multiplicities 
and ip\ , . . . , tp r being the corresponding orthonormal eigenfunctions (obviously, there are 
multiple choices of tpjS in the case of repeated eigenvalues). We will define sign(S') as 
sign(S) := Y^j=i s iS n (/ u i)(V ; i ® V'j) an d the support of S as supp(S) := l.s.j^i, . . . , ?/v}Il] 

1 "l.s." means "the linear span". 
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For 1 < p < oo, define the Schatten p-norm of S as 

Hp 

\S\\ P := (tr(\S\n) 1/p = lEl^l' 1 



where IS"! := v S 2 . For p = 1, \\ ■ ||i is also called the nuclear norm and, for p = 2, || • ||2 
is called the Hilbert-Schmidt or Frobenius norm. This norm is induced by the Hilbert- 
Schmidt inner product which will be denoted by (•,•). The operator norm of S is defined 
as ||5|| := maxj \fij\u 

Let II 2 := LT II be the distribution of random couple (X, X'). The L2(II 2 )-norm 
of kernel S, 

\\ S \\i 2 m*) = [ \S(u,v)\ 2 U 2 (du,dv)=nS(X,X')\ 2 , 

JVxV 

is naturally related to the sampling model studied in the paper and it will be used to 
measure the estimation error. Denote by (•,•) L 2 (n 2 ) the corresponding inner product. 
Since II is the uniform distribution in V, H^Hwn 2 ) = m_2 |l'S'll2 an d (Si, S^z^n 2 ) = 
m~ 2 (Si, S'2}. In what follows, it will be often more convenient to use these rescaled 
versions rather than the actual Hilbert-Schmidt norm or inner product. 

We will also denote by {e v : v £ V} the canonical orthonormal basis of the space 
MY . Based on this basis, one can construct matrices E UjV = E V)U = \[e u ® e v + e v ® e u ). 
If v%, . . . , v m is an arbitrary ordering of the vertices in V, then {E V jV . : j = 1, . . . , m} U 
{y/2E ViiVj : 1 < i < j < to} is an orthonormal basis of the space Sy of symmetric 
matrices with Hilbert-Schmidt inner product. 

In standard matrix completion problems, V is a finite set with no further structure 
(that is, the set of edges of the graph or the weight matrix are not specified). In the 
noiseless matrix completion problems, the target matrix S* is to be recovered from the 
measurements (Xj, X'j, Yj), j = 1, . . . ,n, where Yj = S*(Xj,Xj). The following method 
is based on nuclear norm minimization over the space of all matrices that "agree" with 
the data: 

S := argminlHSH! : S £ Sy, S^X'j) = Yj,j = 1, . . . ,n}, (1.1) 

It has been studied in detail in the recent literature, see Candes and Recht (2009), Recht, 
Fazel and Parrilo (2010), Candes and Tao (2010), Gross (2011) and references therein. 
Clearly, there are low rank matrices that can not be recovered based on a random 

2 With some abuse of notation, we also denote occasionally the canonical Euclidean inner product in 
R v by (•, •} and the corresponding Euclidean norm by || ■ ||. 
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sample of n entries unless n is comparable with the total number of the entries of the 
matrix. For instance, for given u, v G V, let S* = E UjV . Then, rank^*) < 2. However, 
the probability that the only two non-zero entries of S* are not present in the sample is 
(1 — ^2-) n , and it is close to 1 when n = o{m 2 ). In this case, the matrix S* can not be 
recovered. So called low coherence assumptions have been developed to define classes of 
"generic" matrices that are not "low rank" and "sparse" at the same time and for which 
noiseless low rank recovery is possible with a relatively small number of measurements. 
For a linear subspace L C M^, let L 1 - be the orthogonal complement of L and let Pl be 
the orthogonal projector onto the subspace L. Denote L := supp(S'*), r = rank(S'*). A 
coherence coefficient is a constant v > 1 such that 

\\PLe v \\ 2 < — , v G V and \(sign(S*)e u ,e v )\ 2 < ^,u,v eV. (1.2) 
(it is easy to see that v can not be smaller than 1). 

The following highly nontrivial result is essentially due to Candes and Tao (2010) (a 
version stated here is due to Gross (2011) and it is an improvement of the initial result 
of Candes and Tao). It shows that target matrices of "low coherence" (for which v is a 
relatively small constant) can be recovered exactly using the nuclear norm minimization 
algorithm (jl.ip provided that the number of observed entries is of the order mr (up to 
a log factor). 

Theorem 1 Suppose conditions h 1.2(1 hold for some u>l. Then, there exists a numer- 
ical constant C > such that, for all n > Cvrm log 2 m, S = 5* with probability at least 
1 — m~ 2 . 

In the case of noisy matrix completion, a matrix version of LASSO is based on a 
trade-off between fitting the target matrix to the data using least squares and minimizing 
the nuclear norm: 



S := argmin Se5 



n-^iYj ~ SiX^X'j)) 2 + eWSh . (1.3) 

This method has been studied by a number of authors, including Candes and Plan (2011), 
Rohde and Tsybakov (2011), Negahban and Wainwright (2010), Koltchinskii, Lounici 
and Tsybakov (2011), Koltchinskii (2011b). In the case of known design distribution IT 
(in particular, in the case of uniform design) one can use instead of f 1 1 . 3 j) the following 
modification of nuclear norm penalized least squares method: 

n 



S := argmin 5e5v 



l S llL(n 2 )-^E y ^' X i) + e ll 5 Hi 

3=1 
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Note that, if the norm || •S'H x, 2 (n 2 ) m P-^P is replaced by the L2(n n )-norm, where II n is 
the empirical distribution based on (Xi,X[), . . . , (X n ,X' n ), then the resulting estimator 
coincides with (|1.3p . 

The next result was proved by Koltchinskii, Lounici and Tsybakov (2011) (see their 
Theorem 4). 

Theorem 2 Assume that, for some constant a > 0, \Y\ < a a.s. Let t > and suppose 
that 

e > jg( J t + log ^ \/ fiMM) 
Then, there exists a constant C > such that with probability at least 1 — e~ l 



|5-^||| 2(n2) < 5 m^ 



\S - S*\\ 2 L (U 2\ + Cm 2 e 2 r&nk(S) 



\L 2 (TP) 

In particular, Theorem [2] implies that, with probability at least 1 — e _t , 

II ~~ ^*llz, 8 (iP) — Cm e T&nk(S*). 

Very recently, Klopp (2012) proved a similar bound for the matrix LASSO estimator 
(|1.3p in the case when the domain of optimization problem is {S : {{SWl^ < a}, where 
WSWLac ■= m&x u ,v£V \S(u,v)\% 

In the current paper, we are more interested in the case when the target kernel S 1 * 
is defined on the set V of vertices of a weighted graph with a weight matrix A. This 
allows one to define the notion of graph Laplacian and to introduce Sobolev type norms 
characterizing smoothness of functions on V as well as symmetric kernels on V x V. 

Let G = (V, A) be a weighted graph with vertex set V and weight matrix A. It is 
assumed that A := (a(u, v)) UjV& y is a symmetric mx m matrix with nonnegative entries 
(or, equivalently, a symmetric kernel on V). Denote 

deg(ii) := a(u,v),u G V. 

It is common in graph theory to call deg(it) the degree of vertex u. Let D be the diagonal 
mxm matrix (kernel) with the degrees of vertices on the diagonal (it is assumed that the 
vertices of the graph have been ordered in an arbitrary, but fixed way). The Laplacian 



3 In fact, Koltchinskii, Lounici and Tsybakov (2011) and Klopp (2012) studied low rank recovery prob- 
lems for rectangular matrices. However, modification of their results to the case of symmetric matrices 
is straightforward. 
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of the weighted graph G is defined as A := D — A. Denote (•, •) the canonical Euclidean 
inner product in the m-dimensional space M. v of functions / : V i— > R and let || • || be the 
corresponding norm. It is easy to see that 

= 5 E a(u,v)(f(u)-f(v)) 2 , 

u,vGV 

implying that A : M. v i— > M. v is a symmetric nonnegatively definite linear transformation. 
In a special case of a usual graph (V, E) with vertex set V and edge set E, one defines 
A(u, v) = 1 iff u ~ v (that is, vertices u and v are connected with an edge) and A(u, v) = 
otherwise. In this case, deg(u) is the number of edges incident to the vertex u and 

(A/, />=£(/(«)-/(«))'■ 

The notion of graph Laplacian allows one to define Sobolev type norms ||A p / 2 /||,p > 
for functions on the vertex set of the graph and, thus, to describe their smoothness on the 
graph. Given a symmetric kernel S : V x V i— >■ R, one can also describe its smoothness in 
terms of the norms || A P ' 2 S\\2- Suppose S has the following spectral representation: S = 
Sj=i Vjfyj ® V^Oj where /ij, j = 1, . . . , m are the eigenvalues of 5 (repeated with their 
multiplicities) and ipj,j = l,...,m are the corresponding orthonormal eigenfunctions in 
M y , then 

m m 

\\A p / 2 Sf 2 = tr(A p / 2 5 2 A^ 2 ) = tr(A p 5 2 ) =£^(^,^) = /x 2 ||A p / 2 ^|| 2 . 

i=i i=i 

Basically, it means that the smoothness of the kernel 5 depends on the smoothness of 
its eigenfunctions. In what follows, we will often use rescaled versions of Sobolev norms: 

HA p/2 /llL 2( n) = m-^IIA^/H 2 , ||A^ 2 5|| L2(n2) = m^W^I 2 S\\ 2 . 

It will be convenient for our purposes to fix p > and to define a nonnegatively 
definite symmetric kernel W := dA p , where d is a fixed constant. We will characterize 
smoothness of a kernel S G Sy by the squared Sobolev type norm ||^ 1//2 >S'|l! 2 (n2)- The 
kernel W will be fixed throughout the paper and its spectral properties are crucial in our 
analysisQ Assume that W has the following spectral representation W = Y^T=i ^k{4>k ® 
4>k), where < Ai < • • • < A m are the eigenvalues repeated with their multiplicities 

4 In fact, the relationship of W to the graph and its Laplacian will be of little importance allowing, 
possibly, other interpretations of the problem. 
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and 0i, ... , (frm are the corresponding orthonormal eigenfunctions (of course, there is a 
multiple choice of 4>k in the case of repeated eigenvalues). Let &o := min{/c < m : A& > 0}. 
We will assume in what follows that, for some constant c > 1, Xk+i < cA^ for all k > ko- 
It will be also convenient to set Afc := +oo, k > m. 

Let /) := ||W Al//2 »S'*||L 2 (n 2 ) an d r := rank(S*). It is easy to show (see the proof of 
Theorem [5] below) that kernel 5 1 * can be approximated by the following kernel := 
Y^l j{S*4 ) i)4 ) j){ ( t ) i ® <Aj) with the approximation error 

- 1 1 1,2 (II 2 ) - 

Note that the kernel S* i can be viewed as an 1 x I matrix (represented in the basis 
of eigenfunctions {(f)j}) and rank(S*) < r A I, so, one needs ~ (r A /)/ parameters to 
characterize such matrices. Thus, one can expect, that such a kernel can be estimated, 
based on n linear measurements, with the squared L2(II 2 )-error of the order a ■ 
Taking into account the bound on the approximation error (jl.5p and optimizing with 
respect to I = 1, . . . , m, it would be also natural to expect the following error rate in the 
problem of estimation of the target kernel S* : 



mm 

KKrn 



a 2 (r A 1)1 y i p 



n v A; 



V 



,2 



(1.6) 



We will show that such a rate is attained (up to constants and log factors) for a version of 
least squares method with a nonconvex complexity penalty (see Section [3]). This method 
is not computationally tractable, so, we also study another method, based on convex 
penalization with a combination of nuclear norm and squared Sobolev type norm, and 
show that the rates are attained for such a method, too, provided that the target matrix 
satisfies a version low coherence assumption with respect to the basis of eigenfunctions 
of W (see Section U|) . Finally, we prove minimax lower bounds on the error rate that are 
roughly of the order 

,2^ a 7^7 „2l 



max 

KKm 



a 2 (r A /)/ » p' 



(subject to some extra conditions and with additional terms; see Section [2]). In typical 
situations, this expression is of the same order as the upper bound (II. 6ft . For instance, 
if A; x l 2 P for some /3 > 1/2, then the minimax error rate of estimation of the target 



5 It is easy to modify the results of the paper and to control the error in terms of variance of the noise 
of observations Yj rather than the "range" a of these observations. 
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kernel 5* is of the order 

' 2 1/8 \ W( 2 /3+!) / 2 2/8\ 0/(0+1) 2 \ 2 



n / ' ' \ n / ' ' n / v n 

2 

(up to log factors). When m is sufficiently large, the term a ^ m will be dropped from 
the minimum and we end up with a nonparametric convergence rate controlled by the 
smoothness parameter (3 and the rank r of the target matrix 5* (the dependence on m 
in the first two terms of the minimum is only in the log factors) . 

The focus of the paper is on the matrix completion problems with uniform random 
design, but it is very straightforward to extend the results of the following sections to 
sampling models with more general design distributions discussed in the literature on low 
rank recovery (such as, for instance, the models of random linear measurements studied 
in Koltchinskii, Lounici and Tsybakov (2011), Koltchinskii (2011b)). It is also not hard 
to replace the range a of the response variable Y by the standard deviation of the noise 
in the upper and lower bounds obtained below. This is often done in the literature on low 
rank recovery and it can be easily extended to the framework discussed in the paper by 
modifying our proofs. We have not discussed this in the paper due to the lack of space. 



2 Minimax Lower Bounds 

In this section, we derive minimax lower bounds on the L2(II 2 )-error of an arbitrary 
estimator S of the target kernel S* under the assumptions that the response variable Y 
is bounded by a constant a > 0, the rank of 5* is bounded by r < m and its Sobolev 
norm ||VF 1 / 2 S'*||^ 2 (n2) is bounded by p > 0. More precisely, given r = 1, . . . , m and p > 0, 
denote by S r ^ p the set of all symmetric kernels S : V x V i-+M. such that 

(i) rank(S') < r; 

(ii) \\W^S\\ L2m < p. 

Given r, p and a > 0, let V r , P , a be the set of all probability distributions of (X, X', Y) 
such that 

(i) (X, X') is uniformly distributed in V x V; 

(ii) \Y\ < a a.s.; 

(iii) E(Y\X,X') = S*(X,X'), where 5* G S r , p . 

For P G V r , p ,a, denote S P (u,v) := E P (Y\X = u,X' = v),u,v £ V. 

Recall that {4>j,j = 1, . . . , m} are the eigenfunctions of W orthonormal in the space 
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(•, •)). Then cj>j := ^/m(f)j,j = 1, . . . , m are orthonormal in L2(H). 
We will obtain minimax lower bounds for classes of distributions V r , P ,a in two dif- 
ferent cases. In the first case, we assume that for some (relatively large) value of p > 2 
and some (not too large) constant Q p > 



m^ m U 3 \\ Lp{u) <Q p . (2.1) 

Roughly, condition (|2.ip means that most of the components of vectors (f>j G M v are 
uniformly small, say, <frj(v) x m~ 1//2 ,i> G V,j = l,...,m. In other words, the m x 
m matrix (<£j(u))i=i,."WeV 1S "dense", so, we refer to this case as a "dense case". 
The opposite case is when this matrix is "sparse". Suppose, for instance, that for some 
(relatively small) d > 1 

card{j : ^(v) ^ 0} < d, v G V. (2.2) 

A typical example is the case when basis of eigenfunctions {<f>j,j = 1, • • • , m} coincides 
with the canonical basis {e v : v G V} of M v (then, d = 1). 

Denote ?o := &o A 32. In the dense case, the following theorem holds. 

Theorem 3 Suppose condition \2.1\) holds. Define 

~a 2 {rM)l A p 2 A 1 1 a 2 (rAl) 1 



fijP (r, p, a) := max 

lo<l<m 



A-A — 

' \ A, ' \ p - 1 



n • > A; ' \p- 1Q| / m 4/p 
There exist constants c\,C2 > suc/i i/tai 

inf sup Pp|||5 n - Sp||i 2(II 2) > ci^nH^Pj )} > c 2, 

PeVr.p.a L J 

where the infimum is taken over all the estimators S n based on n i.i.d. copies of(X,X', Y). 

2 

In fact, it will follow from the proof that, if Afc < ^TrAkolko (^ na ^ i s > the smau_ 
est nonzero eigenvalue of W is not too large), then the maximum in the definition of 
fin (r, p, a) can be extended to alN = 1, . . . , m. 

Corollary 1 Suppose condition \2. 1\) holds with p = log m. Let 

~a 2 {rM)l a p 2 a 1 a 2 {rM) 1 
A 17 A 7W 



5 (2) (r,p,a) : = max 
io</<m 



n /x V v Qf ogm 1 lo s m . 

There exist constants ci,C2 > suc/i i/tai 

inf sup Pp{[|S n - Sp||^ 2(ri 2) > cifi n 2 \r,p, a)| > c 2 . 
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Proof. Take p = log m in the statement of Theorem [3] and observe that m 4 / p = e 4 



and -Xr > 1 



p— 1 — logm - 



It is obvious that one can replace the quantity <% (r, p, a) in Theorem [3] (or the 

(2) 

quantity 5n (r, p, a) in Corollary [T]) by the following smaller quantity: 



S^(r, p, a) := max 
n 1 <1<L 



a 2 (r A 1)1 a p 2 



where L 



A m. Moreover, denote 



I := max< I = Iq, . . . , m : (r V Z) ZA; < 



2 

|0 n 



It is straightforward to check that 



max 

lo<l<m 



a 2 (r A 1)1 * p 



n 



A 



a 2 (r A 1)1 v / p 



77 



V 



a 2 (rAl)l y, p 2 
n V A r+1 • 



and, if Z < L, then (5™ (r, p, a) 

Example. Suppose that, for some j3 > 1/2, A; x Z 2/3 ,Z = 1, ... ,m (in particular, it 
means that A/ 7^ and Zq = &o = !)• Then, an easy computation shows that 



J = (Z Am) VI, Z x ( ^ r ) 



A($) 



^2 n \ 1/(2/3+1) / ^ \ 1/(2/3+2) 

a 2 r/ 



Let p = log m and suppose that maxi<j< m \\<pj \\ 2 Lp ^ — Qp- Take L : 
The condition I < L is satisfied, for instance, when either 

9 \ 1/(2/3+1) 



e 2 Q P \/ lo g( m / e ) 



g 

a 2 r 



< c'n 2 2/3+ 1 , or e 2 Q p y/log(m/e) 



e 2 Q P A/ log(m/e) 



1/08+1) 



/\ m. 



<cn 2 2 ' 9 + 2 . 
(2.3) 

where c' > is a small enough constant (this, essentially, means that n is sufficiently 
large). Under this condition, we get the following expression for a minimax lower bound: 

_2_l//9_\ 2/3/(2/3+1) / 9 2//3\ /3/(/S+l) „2„ rr) \ n 2 



We now turn to the sparse case. 
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Theorem 4 Suppose condition h2. 2\) holds and let 



S^(r,p,a): 



= max 

lt)<l<m 



a 2 (r A 1)1 a f? a 




There exist constants ci,C2 > such that 




U\S n - Sp\\ 2 L 



L 2 (IP) 



> Cl 



>c 2 . 



It will be clear from the upper bounds of Section (see the remark after Theorem 



additional term rfl ° gm ^ is correct (up to a log factor). At the same time, most likely, 
the "third terms" of the bounds of Theorem [3] (in the dense case) and Theorem [5] (in the 
sparse case) have not reached their final form yet. A more sophisticated construction of 
"well separated" subsets oiV r , P ,a might be needed to achieve this goal. The main difficulty 
in the proof given below is related to the fact that we have to impose constraints, on the 
one hand, on the entries of the target matrix represented in the canonical basis and, on 
the other hand, on the Soblolev type norm || M^ 1 / 2 5 , || jC , 2 (n 2 ) (f° r which it is convenient to 
use the representation in the basis of eigenfunctions of W). Due to this fact, we are using 
the last representation in our construction and we have to use an argument based on the 
properties of Rademacher sums to ensure that the entries of the matrix represented in 
the canonical basis are uniformly bounded by a. This is the reason why the "third terms" 
occur in the bounds of theorems [3] and HI In the case, when the constraints are only on 
the norm 1 1 Vt^ 1 / 2 ^ 1 1 (n 2 ) an d on the variance of the noise and there are no constraints 
on II^IUoo) ^ is much easier to prove the lower bound of the order 



without any additional terms. Note, however, that the condition ||S , *||l 00 < a is of im- 
portance in the following sections to obtain the upper bounds for penalized least squares 
estimators that match the lower bounds up to log factors. 

Proof of Theorem [3l The proof relies on several well known facts stated below. In 
what follows, K(p\\i>) := — E^log^ denotes Kullback-Leibler divergence between two 
probability measures fi, v defined on the same space and such that v <C p. (that is, v 
is absolutely continuous with respect to p). We will denote by P® n the n-fold product 
measure P® n := P ® P ■ ■ ■ (g) P. The following proposition is a version of Theorem 2.5 in 
Tsybakov (2009). 



[5]) that, at least in a special case when {<j)j} coincides with the canonical basis of R v , the 
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Proposition 1 Let V be a finite set of distributions of (X, X',Y) such that the following 
assumptions hold: 

1. There exists P £V such that for all P £ V , P <C P 

2. There exists a £ (0, 1/8) such that 

J2 K(P® n \\P® n ) < a(card(P) - 1) log(card(P) - 1) 
Per 

3. For all P 1 ,P 2 € V, \\S Pl - Sp 2 ||i 2(n2) > 4s 2 > 0. 
Then, there exists a constant (3 > such that 

inf maxP P {||5„ - S P \\l 2m > s 2 } > (3 > 0. (2.5) 

We will also use Varshamov-Gilbert bound stated below. 

Lemma 1 (Varshamov-Gilbert bound) Let d > 8. There exists a subset E C {—1, l} d 
such that card(£) > 2 d / 8 + 1 and 

d 

J2 M * °i) ^ °" ^-^V a". (2.6) 

i=i 

Another well known fact we need is Sauer's lemma. 
Lemma 2 (Sauer's Lemma) Let N > 1 and let A C {-1, 1}^. // 



then there exists J C {1,...,N} such that card(J) = k and ttjA = {— 1,1} J , where 
ttj : {-1,1}^^{-1,1}J, irj(t u ...,t N ) = (t r .j G J). 

Finally, we use the following elementary bound for Rademacher sums (see de la 
Pena and Gine (1998), p. 21). 

Lemma 3 Let £i, . . . , £jv be i.i.d. Rademacher random variables (that is, £j = +1 with 
probability 1/2 and £j = —1 with the same probability) . Then, for all p > 2, 



EVP 



N p , N . 1/2 

^Ejtj < y/p ~ 1 f J ,(h,...,t N ) € 

.7=1 S=l ' 



9 N 
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We will start the proof with constructing a "well separated" subset V of the class 
of distributions V rpa that will allow us to use Proposition [TJ Fix I < m, I > 32 and 
k > 0. Denote V = [1/2], I" = I — I'. First assume that r < I". Denote R a := k( (cry) : « = 



Oj',i' 


R 








Ov^ m -l 


O m -i,v 


O m -ij» 


Om—l,m— 



1, . . . , = 1, . . . , rj , where cry = +1 or cry = —1. Let 7fy jr = {i?o- : cr G { — 1, l} //xr } 
(so, T^-i' r is the class of all I' x r matrices with entries +k or — k). Given i? G 7?.^ r , let 

R:=( R R ... R O vr ) 

be the I' x I" matrix that consists of [l"/r] blocks R and the last block Oi'j*, where 
I* ■— I" — [l"/r]r and O^fa is the k\ x hi zero matrix. Finally, define the following 
symmetric m x m matrix: 



Now, given a G { — 1, l}'' xr , define a symmetric kernel K a : V x V \— > R : 

m 

i^:= 

It is easy to see that 

/' r [J"A-]-l 

K a (u,v) = K^^o-ijfaiu) ^2 4>l>+rk+j{v) + (2.7) 

i=l j=l fc=0 
r V [l"/r]-l 

^2 4>l'+rk+i(u)4>j(v). 
i=l j=l fc=o 

Consider the following set: A := {cr G {— l,l} //><r : max U)Ve v \K<r(u,v)\ < a}. We 
will show that, if k is sufficiently small (its precise value to be specified later), then the set 
A contains at least three quarters of the points of the combinatorial cube {—1, l}' xr . To 
this end, define £ := max U) „ e v \K e (u,v)\, where e G { — 1, 1}' xr is a random vector with 
i.i.d. Rademacher components. Assume, in addition, that e and (X, X') are independent. 
It is enough to show that £ < a with probability at least 3/4. We have 

P{£>a}< Yl W{\ K e(.u,v)\ > a} = m 2 E¥{\K £ (X, X')\ > a\X,X'} = 

u,v£V 

m >n\w>)\> a} < ^ K f< x 'w . (2.8) 
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We will use Lemma [3] to control E(\K £ (X, X')\ P \X, X') (recall that K £ (u,v),u,v G V is 
a Rademacher sum). By representation (|2.7p . K £ (u,v) = K e (u,v) + K E (y,u), where 



z' r 



Denote 



Observe that 



[l"/r]-l 

K' £ (u,v) = K^y^eij(j)i{u) ^2 <Pi'+rk+j(v). 

i=l 3=1 fc=o 
i' r ,P'7r]-l n2 

t 2 (u,u) := J^53$(«)f 0»'-H*+j( v )) ■ 

»=i i=i ^ fc=o ' 

Z" / 
t 2 (u,v) < —q(l',u)q(l",v) < q(l,u)q(l,v)~, 



where q(l,u) := Ylj=i ( t>j{ u )i u £ ^> an d we used the bound 
[l"/r]-l 2 ^r/H-i 

(Pl'+rk+j(v)j <— ^ ^+rfc+j(«)- 
fc=0 ' fc=0 

Thus, applying Lemma [3] to the Rademacher sum K £ , we get 

E\K £ (u,v)\ p < 2 p - 1 (e\k' £ {u,v)\ p + E\K £ (v,u)\ p ) < 



(2.9) 



2 p (p - l) p / V(r 2 (u, v) V t 2 {v, u)) p/2 < 2 p (p - l) p l 2 K p q p l 2 {l, u)q p/2 {l, v) (-) 
Given p G [2, +oo], denote 



This yields 



^p/2( n ) 



1 ' 

3=1 



, I = 1, . . . , m. 



E\K £ (X,X')\ p = m\Ke(X,X')\ p \X,X') < 
2 p {p - l) p / 2 K p ^y /2 E(q p / 2 (l, X)q p / 2 {1, X')) 

2 p (p-l) p / 2 K p (-Y /2 {Eq p / 2 (l,X)) 2 



np-iy^aru^YQw. 

Substituting the last bound into (|2.8p . we get 

, m 2 E\K £ (X,X')\ p 9 „, Sn/9 K P rl\Pl 2 / I \ p 
P{£ >a}< 1 V — < m 2 2 p (p - l) p/2 



aP 



aP \r 



Q P M 
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Now, to get P{£ > a} < 1/4, it is enough to take 

" [P j Q P (l) I Vl m*/*' ( J 

Next observe that 

'"4 U[l'r/2]/ ^Ui 

It follows from Sauer's Lemma (see Lemma [2]) that there exists a subset J C '■ 
1 < i < I', 1 < j < r} with card(J) = [l'r/2] + 1 and such that vrj(A) = {-1, 1} J , where 
ttj : {-1,1}'' xr ^ {-1,1} J , 7rj(<7y : i = = l,...,r) = (ay : G j). 

Since Z > 32, we have i'r > 16 and card(J) > 8. We can now apply Varshamov- Gilbert 
bound (see Lemma [T]) to the combinatorial cube {—1,1}^ to prove that there exists a 
subset E C {-1,1} J such that card(£) > 2 iV / 16 + 1 and, for all a', a" G E,a' ^ a", 
Yl(i I( a ij 7^ a ij) ^ if - ^ i s now possible to choose a subset A' of A such that 
card(A') = card(£) and vrj(A') = E. Then, we have card(A') > 2'' r / 16 + 1 and 

for all a', a" G A', <r ; ^ a". 

We are now in a position to define the set of distributions V. For a G A', denote by 
P a the distribution of (X, X',Y) such that 

(i) (X, X') is uniform in V x V; 

(ii) conditional distribution of Y given (X, X') is defined as follows: 

F Pa {Y = +a\X,X'}=p a (X,X') = l/2 + K a (X,X')/8a, 

VpAY = ~a\X,X'} = 1- P(T (X,X') = 1/2- K a (X,X')/8a. 

Since \K a {X,X')\ < a for all a G A', we have p a (X,X') G [3/8, 5/8], a G A. Denote 
V ■= {P a : a G A'}. For P = P a G V, we have 

5 P (n, u) = E(Y\X = u, X' = v) = ^K a {u, v). 

Note that rank(Sp) = xwk{K a ) = rank(i?^) < r (see the definitions of K a and R%). 
Moreover, we have 



\W 1/2 K a \\l 



2 I 

w 1/a £(!#)<;(& ®^-) = x i( R t)l<M\KAl 



2 i,j=l 
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and 

V r [l"M-l r l' [l"M-l 



\\Ka 



2 



« X/ °V ^ ® 4>l'+rk+j + K '^2'^2 (7 ji 0l'+rk+i ® <Aj 

i=l j=l fc=0 i=l j=l fc=0 
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provided that 



Mir 2 ) 

II^ 1/2 5p ct || = ^||W 1/2 ^|l! 2(n2 ) < P 2 , (2-12) 



Therefore, ll^^^llLiir^ ^ -Wi^, so, we have 



m 4a , 



We can conclude that, for all P £ V, Sp £ S r ^ p provided that k satisfies conditions (|2. 10|) 
and (|2.13p . Since also \Y\ < a, we have that V C V r , P ,a- 

Next we check that V satisfies the conditions of Proposition [U It is easy to see that, 
for all a, a' G A' P a , < P a and 

k(pap„.) = e( M x, x') i„ g ^§§> + (i - m*, xo) i og . 

Using the following elementary inequality — log(l + u) < — u + u 2 , |u| < 1/2 and the fact 
that p a (X,X') G [3/8, 5/8], cr G A, we get that 

i^HP,') < ^L\\K a - K a ,\\L 2m < j^m-KAUy G A'. 
A simple computation based on the definition of K a , K a > easily yields that 

\\K a - K a ,\\l < 8K 2 l'r[l"/r] < 8n 2 l'l" < Ak 2 1 2 . 
Thus, for the n-fold product-measures P® n ,P® n , we get 

An k 2 Z 2 

K(Pr\\PT) = nK{PAP«>)<^^ 2 . 

For a fixed a G A', this yields 

1 . . Atik 2 7 2 1 I'r 1 

^- V K(Pf n \\PT) < At < — — < — log(card(A') - 1), (2.14) 

card A' - 1 ^ 10a 2 m 2 ~ 10 16 ~ 10 &v v ; ; ' v ; 

v ' o-'eA' 

provided that 

1 m frl 

K<—a—\—. (2.15) 
- 16 I V n v ; 
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It remains to use (|2.1ip and the definition of kernels K a to bound from below the 
squared distance \\K a — ^cr'|li 2 (n 2 ) f° r cr,a' £ A' ,a ^ a' : 



I'r „ 1 o I 2 

K a - K a i\\2 > 4m~*K J 

Since Sp a = jK a , this implies that 



K« ~ KA% m = ™~ 2 \\K« ~ K al \\l > 4m- 2 K 2 -[l"M > -k 2 — 2 . 



2 



H^-^llL ( n 2 )>2- 1 V-L. (2.16) 
In view of (j27L0|) . (j2T5|) and (f2TT3|h we now take 

16 M n ' * Z a/A7 (P j Q P {l)lVlmVp- 

With this choice of k, P := {P ct : G A'} C T^p- In view of (f2TT6|) and ([2J3D, we can 
use Proposition [T] to get 



inf sup Pp{||5 - 5p||| 2 ( n2) > ci5 n \ > inf sup Pp{||S - S P \\ 2 L2 ^ u2) > ci5 n \ > c 2 , 

(2.17) 

where 

and ci, C2 > are constants. 

In the case when r > i", bound f|2. lTj) still holds with 

a 2 l 2 a p 2 a 1 a 2 1 



A-A J - 



n I \ Xi I ^p- lQ2(/) m 4/ p - 

The proof is an easy modification of the argument in the case when r < I" . For r > I", 
the construction becomes simpler: namely, we define 

O m -l,l' Om-i,I" O m ^i )Trl -i 

where R € Hi' i", and, based on this, redefine kernels K a ,a G {— 1,1}'' X '". The proof 
then goes through with minor simplifications. 

Thus, in both cases r > I" and r < I" , (|2. 17|) holds with 

a 2 {rM)l a p 2 a 1 1 a 2 (rA/) 1 



n 1 x A/ ' x p — 1 



A z ^p-1Q2(/) / m 4/ p - 
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This is true under the assumption that I > 32. Note also that Q p (l) < maxi<j< m \\<pj \\ 2 L ^ < 
Q p . Thus, we can replace Q 2 (l) by the upper bound Q 2 in the definition of 6 n (l). 

We can now choose I £ {32, . . . ,m} that maximizes 5 n {l) to get bound (|2.17|) with 
5 n := min32<;< m 5 n {l). This completes the proof in the case when ko > 32 and Iq = 32. 
If ko < 32, it is easy to use the condition Aj +1 < cA;, / > ko and to show that 

min 5 n (l) < d min 5 n (l), 

32<l<m ko<l<m 

where d is a constant depending only on c. This completes the proof in the remaining 
case. 

□ 

Proof of Theorem [4j The only modification of the previous proof is to replace 
bound by 

Al"M-l .2 [l"M-i 

^ k=0 ' k=0 

Then, the outcome of the next several lines of the proof is that P{£; > a} < 1/4 provided 
that (instead of ([ZIP]) ) 

1 m a 1 



#6 <2-( 1 + 2 /p)( p _ 1 )-i/2. 



Q P (0 I Vdrn^/P' 
As a result, at the end of the proof, we get that (|2.17p holds with 

a 2 {rM)l A p2 ^ 1 a 2 1 

"~ n[ h ~ n /\\ l /\ p -lQ2(l) dm*/p- 

It remains to observe that Q p (l) < ™, which follows from the fact that 

l l m 

and to take p = log m to complete the proof. 



3 Least Squares Estimators with Nonconvex Penalties 

In this section, we derive upper bounds on the squared L2(II 2 )-error of the following 
least squares estimator of the target matrix S* : 

1 n 



18 



where <S r (Z; a) := {S a : S G S r (l; a)}, I = 1, . . . , m, 

S r (l;a) := < S : S G S v ,T&uk(S) < r, \\S\\ L2 (jp) <a,S= ^ <8> 0j) f (3.2) 

Here S" 1 denotes a truncation of kernel S : 5 a («, u) = S(u, v) if «)| < a, S a (u, v) = a 
if S(u, v) > a and S a (u, u) = —a if u) < —a. Note that the kernels in the class S r (l; a) 
are symmetric and rank(5) < r A I, S G <S r (Z; a). Note also that the sets 5 r (Z; a), 5 r (Z; a) 
and optimization problem (|3.ip are not convex. We will prove the following result under 
the assumption that \Y\ < a a.s. Recall the definition of the class of kernels S r ^ p in 
Section [2j 

Theorem 5 There exist constants C > 0, A > such that, for all t > 0, wreZZi probability 
at least 1 — e - * 

In particular, for some constants C, A > 0, /or 5* G 5 ri p and for all t > 0, wniZt probability 
at least 1 — e - *, 



I 5 ? ~ ^IliaCn 2 ) - C 



2^ a 7^7 / A nm \ p2 a 2^ 

lo <M0ijVA^V- 



a (r A Z)Z / Anm 



n 



(3.4) 



Proof. Without loss of generality, assume that a = 1; this would imply the general 
case by a simple rescaling of the problem. We will use a version of well known bounds for 
least squares estimators over uniformly bounded function classes in terms of Rademacher 
complexities. Specifically, consider the following least squares estimator: 

n 

9 ■= argmin^r*- 1 ^(Yj - giXj)) 2 , 
i=i 

where (Xi, Yi), . . . , (X n ,Y n ) are i.i.d. copies of a random couple (X, Y) in T x R, (T, T) 
being a measurable space, \Y\ < 1 a.s., being a class of measurable functions on 
T uniformly bounded by 1. The goal is to estimate the regression function g*(x) := 
E(Y|X = x). Define localized Rademacher complexity 

ip n (5):=E sup \R n (gi-g 2 )\, 

91,9266, ||ffl-92||£ 2(II) <<5 

where II is the distribution of X and R n (g) '■= n~ z2^—i £ j9(^j) is t ne Rademacher 
process, {Ej} being a sequence of i.i.d. random variables independent of {Xj}. Denote 
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?/4(<5) := sup cr>5 ^ n \ a > and ipn(e) := inf{<5 > : ij) n (8) < e}. The next result easily follows 
from Theorem 5.2 in Koltchinskii (2011b): 

Proposition 2 There exist constants c\,C2 > suc/i £/za£, /or a// i > 0, with probability 
at least 1 — e~* 

11$ " S*llL(n) ^ 2 ™f llf - 9*\\l 2 (n) + c i f^H(ca) + ~ 

We will apply this proposition to prove Theorem [5j In what follows in the proof, 
denote S := 5/. In our case, T = V x V, (X, X') plays the role of X and II 2 plays the role 
of IT. Let Q := S r (l; 1), g* = S* and g = S. First, we need to upper bound the Rademacher 
complexity i/j n (5) for the class Q. Let § rim (i2) be the set of all symmetric mxm matrices 
S with rank(S') < r and \\S\\2 < R- The e-covering number N(E r!m (R); \\ ■ \\2\s) of the 
set S rTO (i?) with respect to the Hilbert-Schmidt distance (that is, the minimal number 
of balls of radius e needed to cover this set) can be bounded as follows: 

N(s r , m (R);\\-h;s)< (— -J . (3.5) 

Such bounds are well known (see, e.g., Koltchinskii (2011b), Lemma 9.3 and references 
therein; the proof of this lemma can be easily modified to obtain (13.5P ), Bound (13, 5p 
will be used to control the covering numbers of the set of kernels S r (l; 1). This set can 
be easily identified with a subset of the set § rA z,z( m ) (since kernels S G S r (l;l) can 
be viewed as symmetric I x / matrices of rank at most r A I with || <S , ||i, 2 (n 2 ) < 1 an d 
\\S\\2 = m\\S\\i 2 (ji2^ < m). Therefore, we get the following bound: 

/18m\ (m){rA/) 
N(S r (l;l);\\-h;s)< (^) 

Since \\Sl — S2II2 < \\S\ — S2W2 (truncation of the entries reduces the Hilbert-Schmidt 
distance), we also have 

/18m\ (m){rA/) 

Nm-A);\\-h;s)< (^j 

Note that 

n n 

\\S\— ^llijj^) = n^ 1 ^2, E x . tX i,} 2 < n _1 \\E Xj ,x'. WlW^-^Wl < ll^i -^Hl- 

3=1 3=1 
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Therefore, we get the following bound on the L2(n n )-covering numbers of the set S r (l; 1) : 

/18m\ (m)(rA ° 
N(S r (l;l);L 2 (Il n y,e) < f — \ 

Here II n denotes the empirical distribution based on observations (Xi,X[), . . . , (X n ,X' n ). 
The last bound allows us to use inequality (3.17) in Koltchinskii (2011b) to control the 
localized Rademacher complexity ip n (fi) of the class Q as follows: 



ip n (5) = E sup 

S 1 ,S 2 €Sr(i,l),\\S 1 -S 2 \\ 2 r m2 <S 



n-'^Te^iX^X'j) -S 2 (X j ,X' j )) 

3=± 



<Ci 



5l{r A I) 



log 



Am 



V 



l(r A I) 



n 



log 



Am 



(3.6) 



with some constant A, C\ > 0. This easily yields 

,l f n ^ n ( r A 1)1 . / inm 



log 



(r A/)/ 



with some constants A, C 2 > 0. Proposition [2] now implies bound (|3.3 
To prove bound (|3.4|) . it is enough to observe that, for S* G 5 rjP , 

2p 2 



inf ||5 — 

SeS r (l;l) 



*llL 2 (n 2 ) 



< 



Ah 



(3.7) 



Indeed, since 5* G 5 rjP , we can approximate this kernel by Si := Yli J - =1 {5'*0-i, 0j) (^(8)0^). 
For the error of this approximation, we have 



\Si - 5 , *|li 2 (n 2 ) = m 2 \\Si ~ $ 



; 112 — m ^ ^ • • V.. c 

iVj>l 



< 



m 



1 m 1 m 

2 i— E E w*^;) 2 + ™~ 2 .7— E E A ^*^> ^> 2 ^ 



i>« j=l 



i+i 7 



i=l j>i 



^1 



which implies 115*/ — <S'*|| 2 2 (n) — W^i ~ '^'*ll| 2 (n 2 ) — ( smce t ne entries of matrix S* 
are bounded by 1 and truncation of the entries reduces the Hilbert-Schmidt distance). 
We also have rank(S^) < rank(5*) < r and 



\\Si\\l 2 {u 2 ) = m 1 \\Slh<m 1 \\S*\ 



lL 2 (n 2 ) 



< \\S 



* \\L 



< 1. 



Therefore, Sj G S r {l\ 1) and bound (|3.7p follows. Bound (13. 4p is a consequence of (13.3 
and (I37FD. 
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□ 

Remark. Note that, in the case when the basis of eigenfunctions {(f) j} coincides 
with the canonical basis of space R^, the following bound holds trivially: 

II <? <?l|2 4a 2 / 2 2p 2 

\\Si-s*\\l 2 (ip) < -^r + x^l (3 ' 8) 

This follows from the fact that the entries of both matrices Si and Si are bounded 
by a and their nonzero entries are only in the first I rows and the first / columns, so, 
\\S~l ~ *Szll! 2 (n 2 ) — ■ Combining this with (|3.4p and minimizing the resulting bound 
with respect to I yields the following upper bound (up to a constant) that holds for the 
optimal choice of I : 



mm 

KKm 



a?(r A 1)1 ( Anm \ a a 2 l 2 \\i p 2 



I .a/ in \ a Hi \ \ I 1 1 \ I a 2 t 



n \(r A 1)1 J ' ' m 2 J " A^+i 

It is not hard to check that, typically, this expression is of the same order (up to log 
factors) as the lower bound of Theorem [4] for d = 1 . 

Next we consider a penalized version of least squares estimator which is adaptive 
to unknown parameters of the problem (such as the rank of the target matrix and the 
optimal value of parameter I which minimizes the error bound of Theorem [5]). We still 
assume that \Y\ < a a.s. for some known constant a > 0. Define 

(r, f) := argmin rj/=li ... im {n- 1 f>, - S^Xj, Xj)) 2 + log( ^0 } 

and let S := S~ ^ . Here K > and A > are fixed constants. 

The following theorem provides an oracle inequality for the estimator S. 

Theorem 6 There exists a choice of constants K > 0, A > in \3. 9\) and C > in the 
inequality below such that for all t > with probability at least 1 — e~ l 



\s ~ ^IlLiwn ^ 2 min 

2( - > Kr<m,KKm 



inf || 5 — S , *||x,„cn2') + 



^ ,'a 2 (rAl)Ij / Anm \ ^ a 2 (t + log m) 



n \(r A 1)1 ) n 



(3.9) 



Proof. As in the proof of the previous theorem, we can assume that a = 1; the 
general case follows by rescaling. We will use oracle inequalities in abstract penalized 
empirical risk minimization problems (see Koltchinskii (2011b), Theorem 6.5). We only 
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sketch the proof here skipping the details that are standard. As in the proof of Theorem 
[5l first consider i.i.d. copies (Xx, Y"i), . . . , (X n ,Y n ) of a random couple (X,Y) in T x R, 
where (T,T) is a measurable space and |Y"| < 1 a.s.. Let : £ /} be a finite family 
of classes of measurable functions from T into [—1,1]. Consider the corresponding family 
of least squares estimators 

n 

g k := axgmin ffegfc n _1 ^(Y} - g(Xj)) 2 , k £ I. 
3=1 

Suppose the following upper bounds on localized Rademacher complexities for classes 
Gk,k £ I hold: 



E 



sup 

9l,92&Gk,\\9i-92\ 



i 2 ( n ) 



\Rn{gi ~ 9z)\ < ^n,k( 6 )' 6 > °> 



where Vy/c are nondecreasing functions of 5 that do not depend on the distribution of 
(X,Y). Let 



k := argmin fce/ 



- 1 E(YS-5 fe (X i )) 2 + ir(< fc (c 1 ) + ^) 



(3.10) 



and iT, cx are constants and {tk,k £ /} are positive numbers. Define the following pe- 
nalized least squares estimator of the regression function g* : g := gr. 

The next result is essentially due to Massart (2000). It can be also deduced from 
Theorem 6.5 in Koltchinskii (2011b). 



Proposition 3 There exists constants K,a > in the definition i3.10\) of k and a 
constant K\ > such that, for all tk > 0, with probability at least 1 — Ylkel e_ * fc 



|£-ff*|| L2( n)<2inf 



mf^ [| 5 - g4 2 Lm + K X (^ n k {c) + ^ 



We apply this result to the estimator S = S-j v where (f, I) is defined by (|3.9p (with 
a = 1). In this case, T = V x V, (X,X') plays the role of X, g* = S*, I = {(r,l) : 1 < 
r, I < m}, Q rt i = S r (l; 1). In view of (|3.6p . we can use the following bounds on localized 
Rademacher complexities for these function classes 



Sl(rM) / lQg ^ )V 



n 



V6 



l(rAl), [Am 
log —= 
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with some constant C\, and we have 



(r A 1)1 ( Anm 

<, r /(ci) < C 2 — — logf 



with some constant C2 > 0. Define t r j_ := t + 21ogm, (r, I) 6 /. This yields the bound 
Y2(ri)£i e ~ tr ' 1 — e ~ t - These considerations and Proposition [3] imply the claim of the 



. -t 

theorem. 



It follows from Theorem [6] that, for some constant C > and for all t > 

,2 



where 



sup Fp\\\S-S P f Lam >c(A n (r,p,a)\/— )) <e~\ (3.11) 
a 2 (r A 1)1 ( Anm\\, p 2 



A n (r,p,a) : = min 

KKm 



n l0g ((7xil)V Am 

Denoting 

r f, / ( Anm \ p 2 n 

I ■- mmj/ = l,...,m: (r V 1)1 \ l+1 log\j-—^ j > 

it is easy to see that 

a , \ a 2 (r A 1)1 ( Anm \ v , p 2 
A n (r,p,a) = — i Mog Vr 

Example. Suppose that, for some /3 > 1/2, A; x I 213 , 1 = l,...,m. Under this 
assumption, it is easy to show that the upper bound on the squared L2(n 2 )-error of the 
estimator S is of the order 

ay/'V bg Anm^ W/{W+1) ^^p^HogiAnm)^^ 1 ^ a 2 rmlog(Anm)^ y 

(in fact, the log factors can be written in a slightly better, but more complicated way). 
Up to the log factors, this is the same error rate as in the lower bounds of Section [2] (see 
(125 



4 Least Squares with Convex Penalization: Combining Nu- 
clear Norm and Squared Sobolev Norm 

Our main goal in this section is to study the following penalized least squares estimator 
with a combination of two convex penalties: 

n 



~>e,e ■= argmin SeD 



- Y j {Y j - S^X^X')) 2 + eUSIIa + e\\W l l 2 S\\ 2 L 



(4.1) 
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where B C Sy is a closed convex set of symmetric kernels such that 

||/S'||i 00 := max \S(u,v)\ < a, S £ B, 

and e, e > are regularization parameters. The first penalty involved in (|4.ip is based 
on the nuclear norm \\S\\i and it is used to "promote" low rank solutions. The second 
penalty is based on a "Sobolev type norm" ||W /rl ^ 2 »S'|li 2 m2)- It is used to "promote" the 
smoothness of the solution on the graph. 

We will derive an upper bound on the error \\S £> g — S^H^m 2 ) = m_2 |Pe,e ~~ ^lli °f 
estimator S e g in terms of spectral characteristics of the target kernel S* and matrix W. 

As before, W is a nonnegatively definite symmetric kernel with spectral represen- 
tation W = Y^k=i ^k(4>k ® </>fc)> where < Ai < • • • < A m are the eigenvalues of W 
repeated with their multiplicities and (pi, . . . , (j) m are the corresponding orthonormal 
eigenfunctions. We will also use the decomposition of identity associated with W : 

E(X) := ^(^®^), A>0. 
A,<A 

Clearly, {E(X), A > 0} is a nondecreasing projector- valued function of A. Despite the fact 
that the eigenfunctions are not uniquely defined in the case when W has multiple 
eigenvalues, the decomposition of identity {E(X),X > 0} is uniquely defined (in fact, it 
can be rewritten in terms of spectral projectors of W). The distribution of the eigenvalues 
of W is characterized by the following spectral function: 

rn 

F(X) := tr(£(A)) = [|£(A)||| = £ /(A, < A), A > 0. 

3=1 

Denote ko := F(0) + 1 (in other words, ko is the smallest k such that A& > 0). It was 
assumed in the Introduction that there exists a constant c > 1 such that Xk+i < cXk for 
all k > ko. 

Let F : R + i-> R + be a nondecreasing function such that F(X) < F(X),X > 0, the 
function A i— )■ is nonincreasing and, for some 7 G (0, 1), 

f°° F(s) , 1 F(X) , 

/ -4^(fe< ^,A>0. 

Jx s 2 7 A 

Without loss of generality, we assume in what follows that .F(A) = m, A > A m (otherwise, 
one can take the function F(X)Am instead). The conditions on F are satisfied if for some 
7 G (0, 1), the function is nonincreasing: in this case, is also nonincreasing and 

/•°° F(s) _ f°° F(s)_Js_ £(A) r _ds_ _ 1 £(A) 
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Consider a kernel S G Sy (an oracle) with spectral representation: S = Ylk=i /-^(V'fc® 
ipk), where r = rank(5) > 1, are non-zero eigenvalues of S (possibly repeated) 
and ipk are the corresponding orthonormal eigenfunctions. Denote L = supp(S') = 
l.s.(?/>i, . . . ,ipr)- The following function will be used to characterize the relationship be- 
tween the kernels S and W : 

<p(S;X) := (P L ,E(\)) : = £ ||P L ^|| 2 , A > 0. (4.2) 

It is immediate from this definition that ip(S,X) < F(X) < F(X),X > 0. Note also that 
(f(S, A) = Y^jLi \\PL<ftj\\ 2 = i", A > A m . Denote by ^ = vE^vi/ the set of all nondecreasing 
functions (p : R + h-> IR+ such that A i— > is nonincreasing and (p(S; A) < <^(A), A > 0. 
It is easy to see that the class of functions ^s,w contains the smallest function (uniformly 
in A > 0) that will be denoted by <p(S; A) and it is given by the following expression: 

tp(S; A) :=supF(cx) sup ^ S]a '^ 



cr<\ a>>a F{<j') 

It easily follows from this definition that (p(S,X) = r, X > A m . Note that since the 
function is nonincreasing and it is equal to ^ for A > A m , we have 

ip(S\ A) > —F(X) > —F(X),X > 0. (4.3) 
m m 

Given t > and A € (0, X ko ], let i n>TO := i + 3 log ^2 log 2 n + i log 2 ^ + 2^. Suppose 
that, for some constant D > 0, 



£ >Da( / 0g(2m) \/ l0g(2m) ). (4.4) 



Theorem 7 There exists constants C, D depending only on c, 7 such that, for all e G 
[0, A ] with probability at least 1 — e - *, 



\S E ,e ~ S4l 2(u2) < inf 



I ^ " <S*||i 2(n2) + CmV^Sjr 1 ) (4.5) 



+e||W 1/2 ,S| |2 



_|_ b n,m 



n 



lL 2 (n 2 ) 

Remarks. 1. Under the additional assumption that mlog(2m) < n, one can take 
e = F)ay^^^^-. In this case, the main part of the random error term in the right hand 
side of bound (14.51) becomes 
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2. Note also that Theorem [7] holds in the case when e = 0. In this case, our method 
coincides with nuclear norm penalized least squares (matrix LASSO) and (/?(£; e -1 ) = 
rank(S'), so the bound of Theorem becomes 



\S £ ,o — <S*lli 2 (n 2 ) — inf 



S ~ ^*Wl2(u 2 ) + Cm 2 e 2 rank(S') 



+ C 



a 2 t 



II 



(4.6) 



Similar oracle inequalities were proved by Koltchinskii, Lounici and Tsybakov (2011) 
(see Theorem [2] in the Introduction) for a modified least squares method with nuclear 
norm penalty (jl.4p . Clearly, fj4.6|) implies that 



|5 £)0 - ^IIl^hz) < Cm 2 e 2 rank(S'*) + C 



a z t r 



n 



(4.7) 



which is a version of a bound proved very recently by Klopp (2012) for the matrix LASSO 
with constrained Loo-norm. 

3. Suppose that 

I 



S{l;a) := Is : S G S V ,S = ^ . 



and consider the following estimator 



■Sj : = ar g min SG5(Z;a) 



1 " 



Let now W be the orthogonal projection onto l.s.{<fo+i, . . . , </> TO } (for an orthonormal 
system {i^i, . . . , (j>m}.) Since, for all 5 G S(l;a), || Vl^ 1//2 S'|| j ^ 2 (n 2 ) = 0, the estimator Si 
coincides with S B g for an arbitrary e > 0. It is easy to check that, for this choice of W, one 
can take F(\) := (ra\/AVl)Am, and we have y?(S; A) = Y^j=i W^L^j || 2 V ^T^> where r = 
rank(S). We will choose e := m 2 and A = m~ 2 . Then, e" 1 ) = £j=i ||PL</>j|| 2 V ;£> 
and the bound of Theorem [7] becomes 



\ s i ~ s *Wl 2 {u 2 ) - 



inf 

S€S(l;a) 



I C _ C ||2 



+ Cm 2 e 2 ( ^ H-Psupp(S)^ 
•j'=i 



2 v / rank(S')/ \ 
* m / 



+ C 



a 2 t 



a 



In fact, an inspection of the proof shows that the term 



rank(g)j 



is not needed in this 



special case. 

Using simple aggregation techniques, it is easy to construct an adaptive estimator for 
which the oracle inequality of Theorem [7J holds with the optimal value of e that minimizes 
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the right hand side of the bound. To this end, divide the sample (X±,X[, Y\), . . . , (X n , X' n ,Y n 
into two parts, 

(X^X'j, Yj), j = 1, . . . , n' and (X n , +j , X' n , +j , Y n , +j ), j = 1, . . . , n - n', 

where n' := [n/2] + 1. The first part of the sample will be used to compute the estimators 
Si := £i '■= A^ -1 , I = ho, . . . , m+1 (they are defined by (|4.1|) . but they are based only 
on the first n' observations). The second part of the sample is used for model selection: 



1 n—n' „ 

I := argmin; =fc[)i m+1 - - ) \Y n '+j - Si(X n > +j , X' n , +j ) 



n — n' 

1=1 



Finally, let S := Sf. 

Theorem 8 Under the assumptions and notations of Theorem [7| with probability at 
least 1 — e - ', 



\& ~ r-fn2i < inf 



2\\S-S4l 2m + C inf (mV^e-^+ellW 1 ^]^ \ 

| c a 2 (log(m + 1) + t w , m ) 
n 

Proof. The idea of aggregation result behind this theorem is rather well known (see 
Massart (2007), Chapter 8). The proof can be deduced, for instance, from Proposition 
[2] used in Section [3l Specifically, this proposition has to be applied in the case when Q 
is a finite class of functions bounded by 1. Let N := card(<5). Then, for some numerical 
constant C\ > 

V n v n 

(see, e.g., Koltchinskii (2011b), Theorem 3.5) and Proposition [2] easily implies that, for 
all t > 0, with probability at least 1 — e~ t 

Il5-<?*lli 2 (n) < 2inf ||g-g,|| 2 2(n) + C 2 lQg ^ + t , (4.9) 

where C2 > is a constant. We will assume that a = 1 (in the general case, the result 
would follow by rescaling) and use bound (|4.9|) . conditionally on the first part of the 
sample, in the case when Q := {gi : I = ko, . . . , m + 1}. Then, given (Xj,X'j,Yj),j = 
1, . . . , n' , with probability at least 1 — e~ l 

lie cil2 ■ nc 0112 , n lo S( m + !) + * rA m\ 

\\S ~ S*|| L2(n2) < 2^ mm +i - S^n) + C 2 . (4.10) 



28 



By Theorem [7] (with t replaced by t + log(m + 1)) and the union bound, we get that, 
with probability at least 1 — e - *, for all I = ko, . . . , m + 1, 



5*||i 2(n2) < inf 



+e l \\w 1 /*s\\l 2m 



S-S*\\l 2m +C 3 m 2 e 2 rtS;e; 1 ) 
log(m + 1) + t„,m 



3" 



n 



with some constant C3 > 0. This yields the following upper bound on the minimal error 
minfc </< m+ i \\Si — S*||£ 2 fn) : 



min \\Si - S*\\ L (U 2} < inf 
k <l<m+l " »l 2 (u ) SgD 



fco<t<m+l V v ' 



log(m + 1) + t % 
+ 



n 



Using monotonicity of the function A 1— > (p(S;X) and the condition that A;+i < cA/,/ = 
ko, . . . , m — 1, it is easy to replace the minimum over I in (I4.1ip by the infimum over e. 
Combining (|4.1ip and (|4.10p and adjusting the constants yields the result. 

□ 

Using more sophisticated aggregation methods (for instance, such as the methods 
studied by Gaifas and Lecue (2011)), it is possible to construct an estimator S for which 
the oracle inequality similar to fj4.8j) holds with constant 1 in front of the approximation 
error term ||<S — 5'*||^ 2 (n 2 )- 

To understand better the meaning of function (p involved in the statements of the- 
orems [7] and [8] it makes sense to relate it to the low coherence assumptions discussed in 
the Introduction. Indeed, suppose that, for some v>\, 

\\PM 2 <— ,k = l,...,m. (4.12) 

m 

This is a part of standard low coherence assumptions on matrix 5* with respect to the 
orthonormal basis {4>k} (see (|1 .2j) ) . Clearly, it implies thaj§ 

<p(S;\) < UrF ^ X \ \> 0. (4.13) 
m 



Suppose that n > mlog(2m) and e = Day ° s ^^ . If condition (|4.13|) holds for the 
target kernel with r = rank(S*) and some v > 1, then Theorem [7J implies that with 



'Compare p~T3)) with ((O 
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probability at least 1 — e *, 

H a q 1 1 2 ^ ri a 2 urF( y e- 1 )log(2m) ., ^ ||2 , ^ Q 2 *n,m 
IPe,e- ^*lli 2 (n2) S <- +e\\w ^*llL 2(n 2) + <- - 

and Theorem [8] implies that with the same probability 

\\S-SJ*^<C inf ( ^F(e-i)lo g (2m) /2g A g Q 2 (log(^ + 1) + WO 



lL2(n2) - ^KtfV " " L2(n2 V 

Example. If A& x k 2/3 for some /3 > 1/2, then it is easy to check that F( A) xA 1 ^. 
Under the assumption that ll^ 1 ' 2 ^?*!!^,^) < p 2 , we get the bound 



V 



a 2 (log(m + l) +t r 



n 

Under the following slightly modified version of low coherence assumption fj4. 13f) , 

.^B>£M A >„, (4.15, 
m 

one can almost recover upper bounds of Section [3) 

va 2 p / va 2 p 

n 



a 2 rm j y q 2 (log(m + 1) + £ n ,m) ^ 



The main difference with what was proved in Section [3] is that now the low coherence 
constant v is involved in the bounds, so, the methods discussed in this section yield cor- 
rect (up to log factors) error rates provided that the target kernel S* has "low coherence" 
with respect to the basis of eigenfunctions of W. 

Proof of Theorem [7J Bound (|4.5p will be proved for a fixed oracle S € and an 
arbitrary function <p 6 ^?s,W with <p(X) = r, A > A m instead of (p. It then can be applied 
to the function (p (which is the smallest function in ^s,w)- Without loss of generality, 
we assume that a = 1; the general case then follows by a simple rescaling. Finally, we 
will denote S := S e £ throughout the proof. 

Define the following orthogonal projectors Vl,Pl in the space Sy with Hubert- 
Schmidt inner product: V L (A) := A- P L ±AP L ±, Vj;{A) = P L ±AP L ±,A £ S v . We will 



use a well known representation of subdifferential of convex function S t— > \\S\ 
8\\S\\i = {sign(S) + Vi(M) :M£S V , \\M\\ < l} , 



i 
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where L = supp(5) (see Koltchinskii (2011b), Appendix A. 4 and references therein). 
Denote 



L n {S) :=- V(^-5(A J ,A')) 2 + e||5|| 1 +e||W 1 / 2 5|| 



so that 5 := argmins e]D )L n (5). An arbitrary matrix A G dL n {S) can be represented as 

2 n 2 n e 

A = - V 5(A 4 , A|)£ x . x , - - V x , + + 2—1^5, (4.16) 

i=l i=l 

where V G <9||5||i. Since 5 is a minimizer of L n (S), there exists a matrix A G dL n (S) 
such that —^4 belongs to the normal cone of B at the point 5 (see Aubin and Ekeland 
(1984), Chap. 2, Corollary 6). This implies that (A, 5 - 5} < and, in view of (|4"TT6"D . 

2P„(5(5 - 5)) - ^f>^,^ - S ^ + ( 417 ) 

e(V r , 5 - 5) + 2^(VF5, 5 - 5) < 0. 

Here and in what follows P n denotes the empirical distribution based on the sample 
(X\, A{, Y\), . . . , (X n , X^, Y n ). The corresponding true distribution of (A, A', 1") will be 
denoted by P. It easily follows from (I4.17P that 

2(5 - 5*, 5 - 5 , ) La (p n ) - 2(S, 5 - 5) + 

e(V, 5 - 5) + 2e{W 1 ' 2 S, W 1 ' 2 ^ - S))^) < 0, 

where 

1 ™ 

it J 

We can now rewrite the last bound as 

2(5 - 5*, 5 - S) L2(P) + e(V, 5 - 5) + 2e(W 1 / 2 (S - 5), W^ 2 (S - S)) La( iP) 
< ^{W^S, w i/2(§ _ 5)) Za(na) + 2(S, 5 - 5) + 2(P - P n )((5 - 5*)(5 - 5)) 

and use a simple identity 

2(5—5*, 5— 5)i 2 (p) = 2(5-5*, 5-5) i2 ( n 2) = ||5-5*|| 2 2 ( n2 ) + ||5-5|| 2 :2 ( n 2)-||5-5*|| 2 :2( - ri 2- ) 

to get the following bound: 

IIS - S4l 2{u2) + ||5 - 5||! 2(n2) + 2e\\W l ' 2 (S - 5)|| 2 2(n2) + e(T>, 5 - 5)(4.18) 

< II 5 " S*\\l 2{ iP) ~ 2e(^ 1/2 5, W^iS - 5)) L2(n2) + 

2(S, 5 - 5) + 2(P - P n )(5 - 5*)(5 - 5) + 2(P - P n )(5 - 5) 2 . 
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For an arbitrary V S 9[| 1 S'||x, V = sign(S') + V^(M), where M is a matrix with 
\\M\\ < 1. It follows from the trace duality property that there exists an M with ||M|| < 1 
such that 

(V£(M),S-S) = (M,V£(S-S)) = {M,vi{S)) = ||^(5)||x, 

where the first equality is based on the fact that is a self-adjoint operator and the 
second equality is based on the fact that 5 has support L. Using this equation and 
monotonicity of subdifferentials of convex functions, we get 

( S ign(S),S-S) + \\Vi(S)\\ 1 = (V,S-S)< (V,S-S). 

Substituting this into the left hand side of f|4.18|) . it is easy to get 

\\S ~ S*llL(iP) + \\S - S\\1 2(IP) + e\\VHS)\\i + 2e\\W l ' 2 (S - 5)||£ a(na) (4-19) 
< - S4l 2{m - £(sign(5), S-S)- 2e(W 1 / 2 S, W 1 ' 2 ^ - S)) L2m 
+2(S, S-S) + 2(P- P n )(S - S*)(S -S) + 2(P - P n )(S - S) 2 . 

We need to bound the right hand side of (|4.19|) , We start with deriving a bound on 
(sign(S'), S — S), expressed in terms of function tp. Note that, for all A > 0, 



<sign(5), S -S) = J2(sign(S)4> k , (S - S)<p k ) = 
k=l 

E (sign(5)^, (S - S)<M + E ( ^l^ V^(S - S)<p k 



which easily implies 

1/2 / \ 1/2 



|(sign(5),5-S)| < ( E ||sign(5)^|| 2 E ll(S-S)^|| 2 ) + (4-20) 

A fe <A ' ^A fe <A ' 

( E " Sign( / t )fe " 2 ) " 2 (E - *)^ll 2 ) V! £ 

Afc>A 7 Afc>A 

(E ii^ii 2 ) 1/2 ii5-5|| 2 + (e e ^) 1/2 ii^ 1/2 (5-5)ii 2 - 



^A fc <A ' v A fc >A 

We will now use the following elementary lemma. 
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Lemma 4 For all A > 0, 

V \\PM\ 2 < c m. and V -1 < c M 



Afe A ^— ' Afc 



where := 

7 7 



Proof. Denote := Ylj=i II-Pl'AjII 2 ) ^ = lr--i m ' Suppose that A G [A;,Aj + x] 
for some I = ko — 1, . . . , m — 1. We will use the properties of functions </? 6 ^s.w an d 
F. In particular, recall that the functions and ^j^- are nonincreasing. Using these 
properties and the condition that \k+i — cA/c, k > ko we get 



V l|PL0 fc || 2 _ V IV 



Am Aj+i 



fc=j+l ^ fc Afc +!/ A ™ fc =J+i A fc+1 A ™ 

a s A 7 A F(s) s J A 

^(A) f°°F(s) dc | y(A) < c y(A)F(A) | <p(A) _ c + 7 y(A) 



F(A) As 2 A - 7 F(A) A A 7 A 

which proofs the first bound. To prove the second bound, replace in the inequalities 
above ||PL0fc|| 2 by 1 and ip(X) by F(X). In the case when A > A m , both bounds are 
trivial since their left hand sides are equal to zero. 

□ 

It follows from from (|4.20p and the first bound of Lemma 0] that 

|<sign(s), s -s)\< v/RaJII-s - 5|| 2 + ^Sfww^is - S)\\ 2 = 

m^{X)\\S - S\\ L2m + m J^M\\wV\§ - S)\\ L2(n2) . (4.21) 
This implies the following bound: 



s\( S iga(S),S-S)\ < (4.22) 
<p{\)m 2 e 2 + \\\S- S||| 2(n2) + c^^f + \\\W^(S - S)\\l (n2) , 

where we used twice an elementary inequality ab < a? + jb 2 , a, b > 0. We will apply this 
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bound for A = e to get the following inequality: 

e\(sign(S),S-S)\< (4.23) 
(c 7 + lMfVs 8 + - S||| 2(n2) + £ -\\Wy\S - 5)||| 2(n2) . 

To bound the second term in the right hand side of (|4.19p . note that 

eKW^S, W l ' 2 (S - 5)) i2(n2) | < eWW^Sh^WW^iS - 5)|| i2(n2) <(4.24) 
^1|W 1/2 5||| 2(n2 ) + |ll^ 1/2 (5-5)||| 2(n2) . 

The main part of the proof deals with bounding the stochastic term 

2(S, S - SJ + 2(P - P n )(5 - - S) + 2(P - P n )(5 - S) 2 . 

in the right hand side of (|4.19p . To this end, define (for fixed S, 5*) 

f A (y, u, v) := (y - S*(u, v))(A - S)(u, v) - (S - 5*) (it, v)(^4 - S)(u, v) - (A- S) 2 (u, v) = 
(y - S(u, v))(A - S)(u, v) — (A — S) 2 (u, v) 

and consider the following empirical process 

a n (S 1 ,6 2 ,d 3 ) := sup[\(P n - P)(f A )\ : A e T(5 1 ,5 2 ,5 3 )}, 

where 

T(5 1 ,5 2 , 5 3 ) := [\\A - S\\ L2m < St, \\ViA\h < S 2 , \\wV\A - S)\\ L2m < 5 3 }. 
Clearly, we have 

2(5, S-S*) + 2(P - P n )(S - S*)(S -S) + 2(P - P n )(S - S) 2 < (4.25) 
2a n (\\S - S\\ L2m , \\ViSh, WW^iS - S)\\ L2m ) 

and it remains to provide an upper bound on a n (6i, S 2 , S 3 ) that is uniform in some inter- 
vals of the parameters St, S 2 , S 3 (such that either the norms |p— ^||t 2 (n 2 )j I|Pl ^Ik' ||^ 1 ^ 2 (>5 — 
*S , )llL 2 (n 2 ) belong to these intervals with a high probability, or bound of the theorem triv- 
ially holds). Note that the functions f A are uniformly bounded by a numerical constant 
(under the assumptions that a = 1, \Y\ < a and all the kernels are also bounded by 
a) and we have Pf A < c\\\A — S\\ 2 L , n -. with some numerical constant c\ > 0. Using 
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Talagrand's concentration inequality for empirical processes we conclude that for fixed 
81, 8 2 , £3 with probability at least 1 — e - ' and with some constant c 2 > 

a n (<5i,<5 2 ,<5 3 ) < 2Ea n (S 1 ,S 2 ,S 3 ) + c 2 (s 1 J- + - 

V V n n 

We will make this bound uniform in 8^ G [8^ , 8^], 8^ < 8^ ,k = 1,2,3 (these intervals 
will be chosen later). Define 8 J k := 8^2~^,j = 0, . . . ,[log 2 (8 k /8 k )] + l,k = 1,2,3 and 
let t := t + Yll;=i ^°s(\}°E2(^k / )] + ^ J • the un i° n bound, with probability at least 
1 - e"< and for all j k = 0,..., [log 2 (5+ /8~)} + 1, k = 1, 2, 3, 

a n («5f , , Si 3 ) < 2Ea n (<5f , 5% + c 2 ( <5f ^1 + ^ 

By monotonicity of a n and of the right hand side of the bound with respect to each of the 
variables 61,62,83, we conclude that with the same probability and with some numerical 
constant c 3 > 0, for all 8^ G [8^ , 8^], k = 1,2, 3, 



an (81 , 5 2 , 8 3 ) < 2Ea n (2<5i , 2 J 2 , 25 3 ) + c 3 ( <5i \ - + - 

V n n 



(4.26) 



To bound the expectation ~Ea n (28\, 28 2 , 28%) in the right hand side of (|4.26|) . note 
that, by the definition of function /a, 

Ea n (5 1 ,5 2 ,5 3 ) < Esup{|(P n - P)(y - S){A - S)\ : A G T(8 1 ,8 2 ,5 3 )} + 
Esup[\(P n -P)(A-S) 2 \ : AeT(8i,8 2 ,8 3 )]. (4.27) 

A standard application of symmetrization inequality followed by contraction inequality 
for Rademacher sums (see, e.g., Koltchinskii (2011b), Chapter 2) yields 

Esup[\(P n -P)(A-S) 2 \ : AeT(8 1 ,8 2 ,8 3 )} < 

16Esup{ Rn(A-S) : A G T{8\, 8 2 , 8 3 )\. (4.28) 

It easily follows from P~2T|) and P~28j) that 

Ea n (8 1 ,8 2 ,8 3 ) < Esup{|(Ei,A- S)\:Ae T(5 1 , 5 2 , 5 3 )} + 

WE sup{\(~ 2 , A -S)\: A £T(8 1 ,8 2 ,8 3 )}, (4.29) 



where 



1 n 

Ej := - £(Y - - S^Xj))^. - - E(y - S(X,X'))i^ 
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and H2 := ^ Yl 1 j=i £ j^x x'.i i £ j} being i.i.d. Rademacher random variables independent 
of (Xi, X[, Y\), . . . , (X n , X' n , Y n ). We will upper bound the expectations in the right hand 
side of (|4.29p . which reduces to bounding Esup||(Hj, A — S)\ : A G T{8\, 62, Ss)\ for each 
of the random matrices Hi, E2. For 7 = 1,2 and A G T(Si, 82, 63), we have 



I ,A - S) I < I (3, , V L {A - S)) I + I (H, , Vi (A) ) \ 

< \(r L Ei,A- 5)| + 11^1111^(^)111 < \(v L ~ u A-S)\+52\ 



(4.30) 



To bound ||Hj||, we use a version of noncommutative Bernstein inequality of Ahlswede 
and Winter (2002) (see also Tropp (2010), Koltchinskii (2011a, 2011b, 2011c) for other 
versions of such inequalities) . 

Lemma 5 Let Z be a random symmetric matrix with EZ = ; a\ := ||EZ 2 || and \\Z\\ < 
U for some U > 0. Let Z\, . . . , Z n be n i.i.d. copies of Z. Then for all t > 0, with 
probability at least 1 — e~* 



n / 
i=l \ 



t + log(2m) 



n 



\/u 



t + log(2m) 



n 



This also implies that 



E 



1 n 
n ^— ' 



i=l 



<i[az\l l0g{2m) \JU-° g{2m) 



n 



n 



It is applied to i.i.d. random matrices 

Z d := (Yj - SiX^XtyEx^x, - E(Y - S(X,X')E XtX > 



in the case of matrix Hi and to i.i.d. random matrices Zj := EjEx^x'. m the case of 
matrix H2. In both cases, \\Zj\\ < 4 and, by a simple computation, cr^. := ||EZ?|| < 4/m 
(see, e.g., Koltchinskii (2011b), Section 9.4), Lemma [5] implies that, for % = 1,2, 

V 



El a,- 1 < 16 



log(2m) v / log(2m) 



nm 



n 



-: e . 



(4.31) 



To control the term KT^Ej, A — S) \ in bound (14.30p . we will use the following lemma. 
Lemma 6 For all 5 > 0, 

E 



sup \{V L ~i,M)\ <4v / 2^TTa/ S^MF^). 



1 
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Proof. For all symmetric m x m matrices M, 



{V L ^, M)= (PlZ, 4>k ® <f>j){M, <f> k (t>j). 
k,j=i 



Assuming that 



||M||1= \(M,<j> k ®<f>j)\ 2 <5 2 and {{W^Mg = £ A fe |(M, fa ® ^-)| 2 < 1. 
fcj=i fcj'=i 

it is easy to conclude that £fc*j=i ^=rffl^ < 2. It follows 

\(P L Ei,M)\< (4.32) 

t (VA ^i (Pi3 ,, iM ,p) 1/2 (E^^V /2 < 

fcj=l ' k,j=l k 



* 1/2 



Consider the following inner product: 

m 

(M 1; M 2 ) W := (K 1 A5 2 )(M 1 ,0 fc ®0 i )(M 2 ,^®0 i ) 
fej=i 

and let || ■ || w be the corresponding norm. We will provide an upper bound on 

, m .1/2 

E||P i S i || w =E( ^ (A^ 1 A5 2 )|(P L ~,0 fc ®^>| 2 ) . 

Recall that 

n 

= n- 1 Y (jE Xj ,x> ~ nCEx,x'), 
i=i 

where Q = Yj — S(Xj,Xj) for % = 1 and £j = £j for i = 2. Note that in the first case 
|Cj| < 2 and in the second case < 1. Therefore, 

■ lftS.ll. < E '^||P i = j |£ < y gMii ^ ^ Ell^-lli (4 . 33) 
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12 . 
\w • 



It remains to bound M\\PlEx,x' 

E\\Pl(E XjX ,)\\1= (4.34) 



fit 

2 



k,j=l k,j=l 



in 



^ (A" 1 A<5 2 )m- 2 \(Eu,v,P L ((t>k®<t>j))\ 2 < 

k,j=l u,vGV 

m m 

k,j=l k,j=l 

m mm 

2m- 1 Y,(K' A 6 2 )\\P L <f>k\\ 2 + 2m~ 2 £>fc 1 A <5 2 ) E H^if = 
fc=l fc=l 3=1 

m m 



2m- 1 ]T(A fc 1 A 5 2 )||P L fc || 2 + 2m" 2 J](A fe 1 A <5 2 )||P L || 2 
fc=i fc=l 

m m 

2m- 1 1 A 5 2 )\\PM\ 2 + 2m- 2 r ^(A- 1 A 5 2 ). 



k=l fc=l 

Note that 

m 

^(a^a^iip^ii 2 ^ 2 £ ||Pl^II 2 + e ^ii^ii 2 - ( 4 - 35 ) 

k=l X k <5- 2 A fc ><5- 2 

Using the first bound of Lemma [H we get from (|4.35p that 

rn 

Y^iK 1 A S 2 )\\PM\ 2 < <*V0S~ 2 ) + c 7 5V(<T 2 ) = (c 7 + 1)$V(<T 2 ). (4.36) 

k=l 

We also have Xlfcli^fc 1 A<5 2 ) < ^2x k <8~ 2 ^ 2 + X^A fe ><5- 2 -^fc 1 ' which, by the second bound 
of Lemma 01 implies that 

rn 

E( A fc 1 A ^ 5 2 F^- 2 ) + Oy5 2 F(d- 2 ) < (c 7 + 1)<5 2 F(<T 2 ). (4.37) 
fc=i 

Using bounds (p~34i (|436j) and (|437l) and the fact that tp(X) > ^F(X), we get, 

n\PL(E x ,x>)\\l < (4.38) 
2m- 1 (c 7 + l^VOT 2 ) + 2m- 2 r(c 7 + l)r5 2 F(cT 2 ) < 4m~ 1 (c 7 + 1)5V(<T 2 ). 

The proof follows from ([OS} . (|Q51) and (IQgjl . 
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Let 5 := Using Lemma [6j we get 

Esup{\(V L ~ i ,A-S)\:A€T(5 l ,5 2 ,5 3 )} < 

Esup{ 1(^,^-5) | : \\A- S\\ L2m <6 1 ,\\W 1 / 2 (A-S)\\ L2m <6 3 ] 
Esup{|(P L H 4 ,A-5)| : \\A-S\\ 2 < 5 1 m,\\W 1 / 2 {A- S)\\ 2 < 5 3 m] < 
5 3 mE S np{\(V L E u A - S)\ : \\A - S\\ 2 < 5, {{W 1 / 2 ^ - S)|U a(IP) < l} < 

In the case when 5 2 > e, we get 

Esap{\(r L Si,A- S)\:Ae T(<5i,<5 2 ,<5 3 )} < 4v / 2 v / ^TWi' 1 " vL ~ 



In the opposite case, when 5 2 < e, we use the fact that the function = £L^lZi^) 
nonincreasing. This implies that <5 2 (/?(<5~ 2 ) < ^(e" 1 ) and we get 

Esap[\(V L Si,A-S)\ :AeT(6 1 ,6 2 ,5 3 )} < 

4,V2 y ^TT^5 1 7^2) = 4,V2^TT^5 3 ^6M6^) < 



n v n 

We can conclude that 

Esup{|(PzSi,A-S)| :^GT(<5i,5 2 ,5 3 )} < 



n v n 

This bound will be combined with (|4,3U|) and (|4.31|) to get that, for i = 1,2, 

Esup||(Si, A - S}\ :AeT(5 1 ,5 2 ,S 3 )}<£*6 2 + 



n V n 

In view of (|4.29p . this yields the bound 



Ea n (5i,5 2 ,5 3 ) < C'e*5 2 + C'sJ ™^ + C'vW 



n v n 
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that holds with some constant C > for all 61,62,83 > 0. Using (|4.26p . we conclude 
that for some constants C and for all 6k G [67, St], k = 1,2,3, 



a n {6i, 62,63) < C 



6nl + sj- + - + s*6 2 + 

n V n n V n 



that holds with probability at least 1 — e *. This yields the following upper bound on 
the stochastic term in (|4.19p (see also (|4.25p ): 

2(5, S - S m ) + 2(P - P n )(S - S*)(S -S) + 2(P - P n ){S - Sf < (4.39) 

T t 



2C 



IS-Sl^t/^^ + H.s-.si 



■n 



L 2 (n2) 



- + - + 

n n 



m<p(e 1 ) 



n 



that holds provided that 

\\s-s\\ L2im g [6r,<^],r^iii g [^^ 2 + ]ji^ 1/2 (5-5)n L2(n2) G [j 3 -,^ 



(4.40) 



We substitute bound (|4.39j) in (|4.19p and further bound some of its terms as follows: 



2C||5 - 5[| ia ( n 5 



< -115-51 



L 2 (n2) 



+ 8C" 



n 



2C||5-5|| L2(n2) V-<3l|5-5||i 2(n2) +8^-, 



and 



2CVi\\W 1 / 2 (S-S)\\ L2 



(IP) 



cmi . / m ^( £ X ) / 1 ^llwl/2^d CMI2 , AS,2 m <P(. £ K 



n 



+ 4C Z 



n 



We will also use (|4T23|) to control the term e|(sign(5),5 - S}\ in (|4TT9|) and (|4T24"j) to 
control the term e\(W^ 2 S, W l ' 2 {S - S))\. If condition flU]) holds with D > 32C, then 
e > 2Ce*. It follows from (|4.19p by a simple algebra that 



\S - S4l 2{u2) < \\S - S*\\l 2m + dm^Vr 1 ) + C x 



mip(e x ) 



+ c\\w 1/2 s\\l 2m + - 



n 



n 



with some constant C\ > 0. Since, under condition (|4.4p with a = 1, m 2 e 2 > D 
D 2 — , we can conclude that 



2c-2 -> n2mlog(2m) 



> 



\S - S4 2 2(m <\\S- S4 2 2m + CamV^e" 1 ) + e||W 1/2 S||l 2(lP ) + ^4.41) 
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with some constant C2 > 0. 

We still have to choose the values of 87 , 8£ and to handle the case when conditions 
(|4,40p do not hold. First note that due to the assumption that ||<S'||l 00 < 1, S £ ID, we have 
\\S-S\\ L2{U) < 2, WV^SWt < \\S\U < V^WSh < m 3/2 and {{W^iS-S)^^ < 2^. 
Thus, we can set 8f := 2, 5% := m 3//2 ,<^" := 2y/X m , which guarantees that the upper 
bounds of (|4.40[) are satisfied. We will also set 87 = 8^ := n^ 1 / 2 , 8$ := \J\- In the 
case when one of the lower bounds of (|4.40p does not hold, we can still use inequality 
(|4.39p . but we have to replace each of the norms US' — S'Hwn), \\Vl ^llij 1 1 Vf^ 1 / 2 (*S — 
S) 1 1 L 2 (n 2 ) which are smaller than the corresponding 87 by the quantity 87. Then it is 
straightforward to check that inequality (|4.41j) still holds for some value of constant C% > 
0. With the above choice of 87,8^, we have t < t + 3 log ^2 log 2 n + \ log 2 ^ + 2^ = t Tl 
This completes the proof. 
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